arXiv:1508.00230vl [cs.NI] 2 Aug 2015 


Toward a Robust Sparse Data Representation for 

Wireless Sensor Networks 


Mohammad Abu Alsheikh*^^, Shaowei Lin^^, Hwee-Pink Tan^-, and Dusit Niyato* 

* School of Computer Engineering, Nanyang Technological University, Singapore 639798 
Sense and Sense-abilities Programme, Institute for Infocomm Research, Singapore 138632 
tSchool of Information Systems, Singapore Management University, Singapore 188065 


Abstract —Compressive sensing has been successfully used for 
optimized operations in wireless sensor networks. However, raw 
data collected by sensors may be neither originally sparse nor 
easily transformed into a sparse data representation. This paper 
addresses the problem of transforming source data collected by 
sensor nodes into a sparse representation with a few nonzero 
elements. Our contributions that address three major issues 
include: 1) an effective method that extracts population sparsity 
of the data, 2) a sparsity ratio guarantee scheme, and 3) a 
customized learning algorithm of the sparsifying dictionary. We 
introduce an unsupervised neural network to extract an intrinsic 
sparse coding of the data. The sparse codes are generated at 
the activation of the hidden layer using a sparsity nomination 
constraint and a shrinking mechanism. Our analysis using real 
data samples shows that the proposed method outperforms 
conventional sparsity-inducing methods. 

Abstract —Sparse coding, compressive sensing, sparse autoen¬ 
coders, wireless sensor networks. 

1. Introduction 

A sparsely-activated data (a few nonzero elements in a 
sample vector) may naturally exist for compressive sens¬ 
ing (CS) applications in wireless sensor networks (WSNs) 
such as the path reconstruction problem ifn, indoor local¬ 
ization Q, and sparse event detection On the other 
hand, a sparse data representation cannot be easily induced 
in many other real-world contexts (e.g., in meteorological 
applications and environmental data gathering). In particular, 
noise patterns are usually presented in collected data from 
WSNs which greatly affect the performance of conventional 
sparsity-inducing (transformation) algorithms such as the Haar 
wavelet and discrete cosine transforms Q. This motivates the 
quest for noise-robust and effective sparsity-inducing methods 
for WSNs. 

One of the breakthroughs in recent deep learning paradigms 
for finding high level data abstractions is achieved by intro¬ 
ducing sparsity constraints on data representations, e.g., the 
Kullback-Leibler divergence Q, rectifier function l[6|, and 
topographic coding Q. These methods are introduced for 
extracting intrinsic features from the data in a similar way 
that the human brain does while encoding sensory organ data, 
e.g., the low percentage of spikes in a visual cortex 0 - 
In particular, sparse deep learning methods generate sparse 
representations across training data for each single unit (i.e., 
lifetime sparsity), and they neither guarantee sparsity for each 
input signal nor assert on the number of nonzero values in 
the sparse codes. However, a practical CS implementation in 


WSNs requires a sparse representation for each input signal 
(i.e., population sparsity) with a sparsity ratio guarantee. 
Specifically, the CS solution to the underdetermined system 
(more number of unknowns than the number of equations) 
is dependent on the sparsity ratio of the signal, and the 
sparsity-inducing mechanism must assert an upper limit for the 
sparsity ratio. This sparsity bounding is necessary in WSNs 
as it enables using only one fiat acquisition matrix for data 
encoding in the node. Therefore, it reduces the CS overhead 
in terms of memory for storing many measurement matrices 
in transmitting node and data control exchange as there is no 
need to send out rate control messages. 

The main contributions of this paper can be summarized 
into three folds as follows. 

1) This paper introduces an effective, population sparsity- 
inducing algorithm with sparsity ratio guarantee. The 
algorithm is based on a customized unsupervised neural 
network model of three layers (also called an autoen¬ 
coder network) that generates the required, sparse coding 
at the second (hidden) layer. In the proposed shrinking 
sparse autoencoder (SSAE), the sparsity is achieved by 
introducing a regularization term to the cost function of 
the basic autoencoder. 

2) We customize the learning algorithm to meet WSN 
characteristics. Eor example, the activations of the hid¬ 
den layer during parameter learning stage are rounded 
to only three places to consider limited computational 
precision of the node. The rounding considers the low 
precision computations of sensor nodes, and it reduces 
the compressed data size and data transmission load. 

3) We present a customized learning method that optimizes 
the SSAE cost function. Basically, the back propagated 
error is only used to update the nonzero and active 
neurons with dominant output values for each input pat¬ 
tern. Moreover, a shrinking mechanism that guarantees 
the sparsity bound is also used during the learning of 
the SSAE’s parameters. Therefore, an SSAE asserts on 
the number of nonzero elements generated at any time 
instant. 

The literature is rich with sparsity-based methods that are 
designed for optimized WSN operations Q-0, ©-d). 
Nonetheless, much less attention is given to the sparsity- 
inducing stage, and using straightforward methods to extract 



Fig. 1: Compressive sensing (CS) based data aggregation 
model: The Rol is assumed to be relatively far from the BS. 
Therefore, a gateway is designed to transmit compressed data 
over a costly long distance wireless connection. 


sparsity basis is common in previous studies such as using 
principal component analysis (PCA) |T0| , discrete cosine 
transform (DCT) 0, |TT| , (T^, discrete Fourier transform 
(DFT) 0, discrete wavelet transforms 0, and difference 
matrices 0 , (T5). However, the sparse coding discipline has 
evolved considerable advances that significantly enhance the 
sparsity-inducing and hence overall WSN operations. There¬ 
fore, this paper is intended to introduce a robust and more 
effective sparsity-inducing method. The proposed method con¬ 
sists of three steps: (i) data collection, (ii) offline training and 
modeling, and (iii) online sparse code generation. An example 
of the online sparse code generation for a CS application is 
shown in Figure which will be described in details later. 

The rest of the paper is organized as follows. In Section |I^ 
the problem formulation is presented. Section [nl| describes the 
proposed algorithm and the SSAE structure. Then, Section |IV| 
discusses important practical issues of training and fitting the 
proposed model. In Section |Vj numerical results using real- 
world data set are presented. Finally, Section |Vj summarizes 
this paper. 


II. Problem Formulation 

Consider a dense wireless sensor network consisting of N 
nodes, as in Figure that collects data about a region of 
interest (Rol). Each sensor i (where i = 1,..., A^) collects 
a real-valued sample Xi (e.g., temperature measurements) 
at a predefined sampling period and transmits packets at a 
configured transmission power that is not sufficient to reach the 
base station (BS) due to long distance propagation. Therefore, 
a gateway (GW) is used to collect a data vector x G from 
all sensor nodes and relay it to the BS for further analysis and 
processing. Thereafter, a historical data matrix X G is 

formulated at the BS containing the collected data vectors as 
its rows, where T is the number of collected vectors. Here, 
the sensors’ oscillators are assumed to be synchronized to the 
GW’s clock. 

After collecting sufficient historical samples (details of data 
collection are elucidated in Section |IV-A| ), and as the GW 
is energy and bandwidth constrained, the GW employs CS 
to spatially compress the data into a smaller data size. The 
radio transceiver is the most energy consuming unit in an 
ordinary sensor node |T^ . Thereby, the energy consumption 
becomes more critical in the GW unit as it transmits huge 


data over the backhaul connection, while sensor nodes are 
assumed to transmit for short distances. It is important to 
note that our algorithm can be also temporally applied at 
each individual sensor node. However, data delivery latency is 
provoked as temporal samples must be collected at the node 
before being transmitted as one compressed chunk. Next, we 
give an overview of the CS framework and its implementation 
at the GW device, and the data reconstruction at the BS unit. 

A. Compressive Sensing (CS) 

CS is a signal processing method for effective data recovery 
from a few data samples than the Nyquist rate (TV). Assuming 
a sparse signal s G that has only K nonzero elements; 
therefore, s is called a iT-sparse signal, and the sparsity ratio p 
is equal to ^. Moreover, suppose a measurement (or sensing) 
matrix ^ G ^ ^ that obeys the restricted isometry property 
(RIP) Here, M is assumed to be much smaller than L; 
therefore, ^ is a fiat matrix with more columns than rows. 
The sensing system under consideration that is executed by 
the GW to compress data can be expressed as 

y = #s (1) 

where y € is the resulted measurement vector. # can be 
sampled from different distributions to meet the RIP such as 
the Gaussian distribution m- Moreover, for high probability 
recovery, M must also meet the following constraint | [2Q| : 

M > pitriogg (2) 

where p is a constant, and M <C At the BS unit, the 
reconstruction of s from y can be achieved by minimizing the 
following relaxed problem m- 

s* = arg min llsIL (3) 

lly-^s||2<e" 

where e is a small constant. The optimization problem 0 can 
be solved using a regularized least square method called least 
absolute shrinking and selection operator (LASSO) p^ . 

B. Sparsity-inducing 

Clearly, the whole CS framework is based on the sparsity 
assumption. Natural signal such as sound and images can 
be transformed into a sparse form by projecting them into a 
suitable basis O- However, this is not the case when dealing 
with WSN data. More precisely, sensor nodes produce noisy 
readings of the form 

X = X* - 1 - z (4) 

where x* G is the noiseless data vector of the physical 
phenomenon, and z G is the added noise vector. Noise 
values are assumed to be independent Gaussian variables with 
zero mean and variance such that 2 ; ^ N(0,cr^/Ar). 
Therefore, even through the neighbor sensors are spatially 
correlated and hence compressible, the noise existence ham¬ 
pers the accurate approximation of source signal x using 
linear projection methods. In particular, smooth signal are 
representable using linear combinations of Fourier bases, and 
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Fig. 3: Illustration of the SSAE structure. 


Fig. 2: Example of data compression, transmission, and recov¬ 
ery operations using CS and sparsity-inducing models. 


smooth piecewise signals are linearly representable in wavelet 
bases m Nonetheless, the smoothness condition is not 
guaranteed in sensor data as data samples are usually affected 
by noise patterns, and commercial sensors sense phenomenon 
with finite precision and are not noise robust. For example, a 
few noise readings can destroy the sparsity pattern of a DCT 
transformed data (T2) 

The main aim of any robust sparsity-inducing mechanism is 
to transform the source signal x G into the sparse signal 
s G An upper bound guarantee on the sparsity ratio of the 
generated signal s is a “must-have” feature in most sparsity- 
based applications such as in CS. In particular, this guaran¬ 
tee enables designing low memory and low communication 
overhead applications for WSNs as a single sensing matrix 
^ is used by the GW unit to compress data. Then, the BS 
does not require any information from the GW to recover the 
reconstruction signal x other than the measurement vector y, 
where x is a reconstruction of the noiseless data vector x*. 

An example of the system online operational procedure is 
shown in Figurej^which includes the sparsity-inducing and CS 
components. The next section presents the proposed sparsity- 
inducing mechanism. 


intrinsic code h G at its activation. Thirdly, an output layer 
that includes the same number of neurons as the input layer 
and generates a recovery of the input data d G The layers 
are connected to each other using the following formulations: 


h = / + 


d = 


(5) 

( 6 ) 


where is the weight matrix connecting the input and 

hidden layers, is the weight matrix connecting the 

hidden and output layers, and and are the biases 
of the input and hidden layers, respectively. Additionally, s 
is the sparse data representation that is obtained by applying 
the shrinking operation as described in Section |III-A| For 


simplicity, we define 9 to contain all the SSAE’s parameters 
such that 6 = b^^\ b^^^]. Moreover, / (•) is the 

non-linear hyperbolic tangent function. 

The SSAE’s cost function F (•) includes two terms as 
follows: 


rWD) = i(x:i 


d(“) - d(“) 

iGiog„(i+(h«)n) (7) 
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III. Shrinking Sparse Autoencoder (SSAE) 

In this section, we introduce an autoencoder’s variant which 
we call shrinking sparse autoencoder (SSAE) as shown in 
FigureThe SSAE network is specially designed to transform 
sensory data from its original domain into a sparse domain. 
The SSAE structure consists of three neural (or computational 
unit) layers. Firstly, an input layer that is connected to the 
input signal d G where N is the number of sensor nodes 
in the network. Briefiy, d is a sphered version of the raw 
sensor data x, where G M| — 1 < < 1} as described in 

Section IV-B| Secondly, a hidden layer is used to generate an 


where D G is the training matrix of historical data 

such that each input vector is stored in a row of 

this matrix, and is the hidden layer activation of d^^\ 
Moreover, T is the training set size configured at the offline 


training algorithm (the details are given in Section III-B ). As 
with any other autoencoder, the first term is the average sum of 
the difference between input vectors and their reconstructions 
at the output layer. This term is used to encourage the neural 
network to reconstruct its input data at the output layer. The 
second term is used to encourage sparsity at the generated 
coding in the hidden layer. The sparsity penalty 7 is a hyper- 







































parameter to manage the weights of each term in the optimiza¬ 
tion problem. In other words, using a big value for 7 results 
in highly sparse representation, but with poor reconstruction 
capability. Then, the well-known delta rule can be used to 
update the SSAE’s weights and biases as follows p3l : 

H*) i (8) 

'^3 

(9) 

ob^ 

where a is the learning rate, and q G {1,2} is the layer 
number within the SSAE network. These update rules are 
executed at each iteration of a gradient descent method. The 
partial derivative is given by 
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dwlf 
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( 10 ) 


where r( 0 ;dl“l) is the cost function defined for a single 
sample G D. This means that the overall partial derivative 
of (|7]) is found by averaging the partial derivatives of all 
input samples. The second term of 0 only affects the partial 
derivative of the hidden layer (q = 2) which is computed as 
follows: 
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Algorithm 1 Pseudo-code for the shrinking operation of 
hidden layer’s neurons. 

1 : Input h G hidden layer activation before shrinking 
2 : Input K: maximum nonzero activations 
3: s = h > copy operation 

4: for i = 0 to L — AT do 
5: P = 0 

6 : for jf = 0 to ( 1 / — 1 ) do 

7: if |sp| > \sj\ and \sj\ > 0 then 

8: p = j 

9: end if 

10: end for 

11: Sp = 0 > zero-out smallest value 

12: end for 
13: Output s G 


generate values close to, but not absolutely zero. Therefore, we 
propose a simple shrinking mechanism that can complete the 
design cycle. Eor each input vector, the proposed shrinking 
mechanism “zero out” the least dominant neurons from the 
hidden layer, and therefore switching them to the deactivation 
mode. The least dominant neurons are the ones with the 
least effect on the data reconstruction at the output layer, 
and hence the minimum activation values that result from the 
sparsity restrictions. Therefore, only K active neurons at the 
hidden layer forward propagate the input through the SSAE 
network, and the remaining L — K neurons are switched off. 
An optimized implementation of the shrinking scheme is given 
by the pseudo-code in Algorithmwhere j-l is the absolute 
value function. 


where /'(•) = 1 — (/(O)^ is the element-wise derivative 
of /(•)• Thereby, the SSAE is designed to generate many 
zeros at the hidden layer. One can think of a neuron as being 
active when its output is not equal to zero, and an inactive 
neuron does not participate in forwarding the input data to the 
output (because it does not generates signals). To this end, two 
important issues of the second term of 0 must be noted as 
follows: 

• The second term minimizes the hidden layer activation, 
but it still does not ensure exactly zero activations. 

• It does not guarantee sparsity ratio at the generated codes. 
Accordingly, a shrinking mechanism must be applied at the 
hidden layer activation and before propagating them to the 
output layer to reconstruct the input. In particular, one can 
think of the second term as only being used as a mechanism 
of nominating the most promising neurons to be deactivated 
by the shrinking mechanism as described in the next section. 

A. Shrinking (Pruning) Scheme 

Even though the cost function of the SSAE is designed 
to generate a sparse data coding at the hidden layer, it 
does still neither guarantee a coding with population sparsity 
(sparsity at each input vector) nor assert on the maximum 
nonzeros for each input. Equally important, it will most likely 


B. Offline Training 

The SSAE’s parameter adjustment is done during an offline 
training stage. As a resource demanding process, the training 
must be performed at the BS unit, and then the SSAE’s 
parameters (9) are disseminated for online data compres¬ 
sion at GW. The learning stage and SSAE’s parameters are 
tuned using a resourceful BS with relatively high precision 
operations. However, GW is usually constrained in terms of 
computational resources and computational precision (i.e., the 
machine epsilon value). Therefore, rounding the activation at 
the hidden layer is needed during the learning stage to match 
the GW’s low precision. Moreover, with rounding, less data 
is transmitted from GW to BS. 

To learn the SSAE’s parameters (0), we minimize 
0 by using a non-linear quasi-Newton method called 
the limited-memory Broyden-Eletcher-Goldfarb-Shanno (L- 
BEGS) method p4| . However, firstly the collected historical 
data X G must be randomly shuffled. This is because 

sensors’ readings are highly correlated over time, and a non- 
shuffied data causes the SSAE to dominantly learn the training 
data’ patterns in training data only. Therefore, the shuffling 
step ensures that the training and testing data sets contain 
all possible data patterns. Moreover, the cross validation 
technique fISl is an effective method for testing the model 













Algorithm 2 The offline training algorithm. 

1: Input X G historical sensor data 

2: Input K: maximum nonzero activations 

3: Input 7 : sparsity hyper-parameter 

4: Input ip: number of folds for cross validation 

5: Randomly shuffle X 

6: Divide X into ip folds X^,..., X^ 

7: for all X %i = 1,... ,ip do 

8 : for all X G (X \ X*) do > held out X* for testing 

9: Sphere input x to get d using © 

10: Append d to D 

11: end for 

12: repeat 

13: for all d G D do 

14: Forwardly propagate d to compute h using 0) 

15: Shrink h to get s as in Algorithm 

16: Compute d using ^ 

17: end for 

18: Compute the cost value using (|7]) 

19: Compute the gradient vector as in ( p^ 

20: Update 6 using the L-BFGS method 

21: until learning converges 

22: Compute accuracy using X* 

23: end for 

24: Compute average accuracy of the ip folds 

25: Output 61= 


generalization capabilities, while benefiting from all available 
samples for training. Cross validation divides the training data 
into ip groups (e.g., 10 groups) then at each time, one group 
is held out for testing while using the remaining for model 
fitting. Then, the model performance is found by averaging 
errors of all cross validation’s groups. The offline learning is 
described in Algorithmic 

The learning algorithm is computationally intensive for 
sensor nodes and must be performed at the BS. Moreover, 
if the statistical parameters of the underlying phenomenon 
change, the offline training must be re-executed and an updated 
should be disseminated to the nodes. 

C. Computational Complexity 

The online encoding and decoding of sparse codes are 
lightweight. In particular, the GW (or a sensor nod^ can 
generate sparse codes by only using b^^^] as in ® and 

Algorithm IC with 0{L x N) of overall time complexity. The 
data recovery is performed at the BS by using 
as in § with a similar time complexity of 0{L x AT). 

D. Sparse Codes 

For the verification and analysis in the following sections, 
a meteorological data set from the Sensorscope project pb} is 
used. The data set contains surface temperature samples of 23 
sensors. The learning curve of the SSAE is shown in Figure [4^ 

An important indication of successful SSAE training is en¬ 
suring that hidden neurons are not connected with zero weights 


to the input layer. In other words, this ensures that any neuron 
in the hidden layer will be active for some input patterns, 
and hence no “always-off” neuron exists. This increases the 
model performance of generalizing to non local data, and 
hence it performs well on extremely non linear data, as all 
neurons participating increases the possible code formulations 
(i.e., the number of distinct combinations is increased when 
having more active neurons). Figure shows hidden layer 
activations over time. Here, two main desirable properties can 
be observed 

1) Population sparsity is achieved, and the maximum num¬ 
ber of active neurons at any time instant is guaranteed 
by the SSAE network. This upper bound of nonzeros in 
a generated sparse code considers the tradeoff between 
the recovery error and compression ratio of the data 
aggregation model. Therefore, only a single sensing 
matrix is needed when using CS to create a measurement 
vector at the GW node. 

2) All neurons are participating in the sparse code gener¬ 
ation, and without any “always-off” neuron. Moreover, 
the activation values of the active neurons are not con¬ 
centrated around very small values near zero. This fea¬ 
ture cannot be achieved in conventional average activity 
ratio sparse autoencoders, such as the Kullback-Leibler 
divergence, as they are designed for lifetime sparsity 
only. 

IV. Discussion and Practical Considerations 

In this section, some practical issues of the SSAE training 
and fitting are discussed. 

A. Data Collection 

A crucial aspect of machine leaning-based approaches, such 
as the SSAE network, is the training data requirement. A 
system designer may have access to a large historical data 
set that is collected in the past. This historical data can be 
used to train the SSAE’s model. However, this is not the 
case for new WSN’s deployments, and the lack of sensor data 
hinders the accurate fitting of the SSAE’s parameters (i.e., 
9). Clearly, the SSAE’s model needs to globally generalize to 
unseen data samples. In any machine learning method, having 
more training data can improve generalization performance, 
but having more data is not the only solution l |27| . In WSNs, 
the following issues must be considered when using an SSAE 
as a sparsity inducing method. 

1) It is assumed that sensor nodes are densely deployed 
and hence spatially correlated with each other (e.g., as 
in Eigure for the Sensorscope project’s data). SSAE 
learns these spatial correlation and redundant patterns 
in the nodes’ collected data. Therefore, if the underly¬ 
ing phenomenon becomes different in the way that it 
changes the nodes’ spatial correlation, then new data 
collection and offline model fitting must be performed. 

2) The amount of data required to fit the SSAE’s model 
depends on the underlying sensed phenomenon, and for 
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Fig. 4: An SSAE which is designed to produce a maximum 
of 5 nonzero values at each time instant (r] = 0.2) for 23 
sensors (i.e., x G (a) A learning curve that shows the 

convergence of the offline learning algorithm, and (b) activa¬ 
tion values of the hidden layer’s neurons. 



Fig. 5: Surface temperature readings of 4 neighbor sensors 
from the Sensorscope deployment over 1 day (1 sample every 
2 minutes). This shows the spatial correlation among sensors’ 
measurements, and hence data compressibility. 
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Fig. 6: Data sphering and its effects on data by showing 
histograms and basic statistical values, (a) Raw data in the 
range of [—16.15,47.91]. (b) Sphered data that is scaled to a 
new range of ( — 1,1) with a Gaussian-like distribution. 


stage. Data sphering is simply achieved by applying the 
following operation on each sensors’ raw input vector x G 

, , / N max (min ((x — x), 3cr), —3cr) 

d = sphere(x, a) = -^^ (12) 

3(j 

where a is the standard deviation of the historical training 
matrix X, T = ^ the arithmetic mean of each input 

vector, and again d G is the SSAE’s input vector which 
is the resulting data vector after sphering. Unlike the stan¬ 
dard element-wise standardization, this subtracts the arithmetic 
mean of each input vector and not the whole training matrix’s 
mean value. The effect of data sphering on training data is 
shown in Figure Clearly, the data is transformed into a 
smoother Gaussian-like curve with zero mean (other statistical 
parameters are also shown). Equally important, the resulting 
scale of sphered data is in the (—1,1) interval, which makes it 
suitable for the operation of the hyperbolic tangent function. In 
particular, the hyperbolic tangent function generates an output 
in the range of (—1,1) and without data pre-processing to this 
range, the SSAE cannot produce outputs similar to input data. 

The reverse operation of data sphering is required at BS to 
reconstruct the original raw sensors’ vector x G from the 
SSAE’s output values d G The reverse operation is given 
as 

X = desphere(d, T, a) = 3crd + T. (13) 

Here, a is constant for all recovered vectors, and therefore 
can be stored at the BS. However, x must be sent from the 
GW to the BS along with the compressed data. Therefore, the 
transmitted data size when using CS is M + 1. 

V. Numerical Results 

In this section, we evaluate the performance of the SSAE- 
based sparsity inducing method. 


more complex correlation patterns among sensors, more 
data samples are needed. 

B. Data Sphering 

Before using historical sensor data to train the SSAE, a 
pre-processing operation is required, namely the data sphering 


A. SSAE Tunning 

One of the main difficulties of applying neural network- 
based methods is the numerical tuning of the network hyper¬ 
parameters. Hyper-parameter setting of autoencoder’s variants 
can be facilitated by searching over a scale of values in the log- 
domain (e.g., values such as 10“^, 10“^, 10“^,...), and then 
the value that minimizes the cross validation error is selected 
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Fig. 7: Sparsity parameter setting for r] = 0.217. Bars 
represent root mean square error (RMSE) values over 10 runs. 
The maximum performance achieved at 7 = 0.2. 



Fig. 8 : Root mean square error (RMSE) versus sparsity ratio 

v = a)- 


accordingly | [28| . Figure [7] shows the setting of the sparsity 
hyper-parameter 7 for sparsity ratio 77 = 0.39. The sparsity 
term in 0 can be interpreted as sparsity nomination term, that 
is fed to the shrinking mechanism to generate sparse codes. 
Therefore, trying different values of 7 is useful to achieve 
maximum signal reconstruction performance. For SSAE in 
the next experiments, the following function is used for the 
sparsity penalty 7 settings 

7 (ry) = 0.26 - 0.267/, (14) 

which is found by manually fitting the hyper-parameter 7 for 
two values of 7 /, as described above, and then finding the line 
connecting these two manually fitted points. 

B. Comparing to Benchmarks 

Using a difference matrix that captures the difference be¬ 
tween adjacent and correlated values as a sparse basis was used 
in 0 (H) Similar to Q, we noted the difference matrix’s 
poor performance in sparsifying the data, and hence it is not 
included in our comparison analysis. 

Figure shows a comparison between the SSAE recov¬ 
ery performance and other conventional methods including 
principal component analysis (PCA), discrete Fourier trans¬ 
form (DFT), discrete cosine transform (DCT), and dictionary 
learning (DL). These conventional methods are chosen for 


comparison as they are widely used in the CS literature 0 . 

0 (ID Two important observations can be made. 

1) Most sparsity inducing algorithms will achieve a rela¬ 
tively similar recovery error at high values of 77 . How¬ 
ever, these high sparsity ratio values (e.g., 77 > 0.7) 
are not typical in practical applications as the reduction 
in data size is not noticeable. Therefore, these values 
cannot be used for CS’s applications as the measurement 
vector size will be similar to the source signal size (i.e., 
N ^ M). On the other hand, SSAE significantly outper¬ 
forms conventional methods for practical low sparsity 
ratios and when the nonzero values in the generated 
sparse codes are required to be minimized. 

2) Conventional DL methods (e.g., p^ , pQ| ) use the 
minimization to model the raw data as linear combina¬ 
tions of sparse bases. In this paper, we used the scikit- 
learn library ED for testing the dictionary learning 
method in which the coordinate descent method is 
used to find the LASSO problem solution. Similar to 
our algorithm, the scikit-leam’s implementation enables 
setting the required sparsity ratio by defining the number 
of nonzero coefficients in the sparse code, while we 
set the remaining parameters to their default values. We 
normalize the data to a zero mean and a unit variance 
before learning the dictionary model. In addition to the 
slightly better performance, we also noticed that the 
learning time of the SSAE method is also shorter than 
the DL method. This is significant for large-scale WSNs. 


C. Noisy Data 

Sensors may report imprecise measurements due to external 
noise sources, inaccurate sensor calibration, unstable power 
supply, and imperfect node design p2| . In this section, we 
assume that noise values are independent Gaussian variables 
with zero mean and variance such that z ^ N (0,cr^/Ar), 
where z G is an added noise vector. We noticed that the 
SSAE method does not only allows the compression of the 
sensors’ data, but it also helps in estimating the noiseless data 
vector of the physical phenomenon x* G 

An overcomplete sparse representation is achieved when 
the number of hidden layer’s neurons (sparse code’s size) is 
greater than the input layer’s neurons (i.e., L > N). However, 
the measurement vector’s size M of CS is proportional to the 
sparse code’s size as in ([^. Therefore, the number of nonzero 
items must be minimized, and less nonzero coefficients are 
defined in the overcomplete sparse code. On the other hand, 
using more neurons in the hidden layers can result in the 
overfitting problem | [33| . Overfitting degrades the neural net¬ 
work’s reconstruction performance and increases the learning 
time of the parameters 6 = . Table ^ 

summarizes the experiments of using overcomplete sparse rep¬ 
resentation. The results also include the case of adding external 
noise 2 ; ^ N (0, /at) to sensors’ measurements. This shows that 
the overcomplete case is useful in unreliable network to reduce 
the noise effects while producing sparse codes. However, in 
noise-free networks, using overcomplete codes can degrade the 







TABLE I: System performance with different numbers of 
hidden neurons. 


L 

K 

7 

M 

RMSE (no 
external noise) 

RMSE (noise 

<t2 = 1) 

23 

5 

0.2 

12 

0.987 

1.522 

(umeliable) 

25 

5 

0.25 

12 

0.930 (best) 

1.512 

(umeliable) 

30 

4 

0.5 

12 

0.982 

(overfitting) 

1.259 (best) 

32 

4 

0.6 

12 

1.027 

(overfitting) 

1.338 


sparsity-inducing algorithm performance due to the overfitting 
problem. 

VI. Summary 

In this paper, we have introduced a sparsity-inducing al¬ 
gorithm for data aggregation of non-sparse signal in wireless 
sensor networks. The proposed method consists of three steps: 
data collection, offline training and modeling, and online 
sparse code generation. The modeling scheme is based on 
a neural network with three layers, where the sparse codes 
are exposed at the hidden layer’s neurons. A cost function is 
introduced as a sparsity nomination scheme. Then, a shrinking 
mechanism is used to switch off the least dominant neurons in 
the hidden layer, while asserting on the number of generated 
nonzero values in the sparse code. The resulting scheme can 
be used in many applications such as in compressive sensing- 
based data aggregation schemes. 

For future research, we will analytically study the en¬ 
ergy consumption and computational burdens of the proposed 
scheme. 
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